
OFFICIAL SENSITIVE
This template demonstrates how to develop and share interactive reproducible analysis using Python and Jupyter Lab.
The data used in this template is for learning purposes only.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import hvplot.pandas
import datetime
import re
import locale
from locale import atof
exam_performance = pd.read_csv("data/students-performance.csv")
students = pd.read_csv("data/total-students.csv")
exam_performance.head(3)
| gender | race/ethnicity | parental level of education | lunch | test preparation course | math score | reading score | writing score | |
|---|---|---|---|---|---|---|---|---|
| 0 | female | group B | some college | free/reduced | completed | 71 | 85 | 83 |
| 1 | male | group B | some high school | standard | none | 41 | 40 | 34 |
| 2 | male | group C | high school | standard | none | 66 | 52 | 54 |
The exam performance data comprises 5,000 records (rows) and 8 features (columns).
exam_performance.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 gender 5000 non-null object 1 race/ethnicity 5000 non-null object 2 parental level of education 5000 non-null object 3 lunch 5000 non-null object 4 test preparation course 5000 non-null object 5 math score 5000 non-null int64 6 reading score 5000 non-null int64 7 writing score 5000 non-null int64 dtypes: int64(3), object(5) memory usage: 312.6+ KB
students.head(3)
| date | aged 18 to 24 in full-time education (thousands) | |
|---|---|---|
| 0 | Mar-May 1992 | 984 |
| 1 | Apr-Jun 1992 | 999 |
| 2 | May-Jul 1992 | 1,012 |
students.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 291 entries, 0 to 290 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date 291 non-null object 1 aged 18 to 24 in full-time education (thousands) 291 non-null object dtypes: object(2) memory usage: 4.7+ KB
Transform the 18 to 24 student population from string 1,000 format to decimal 1000.00 using ASCII to float (atof).
locale.setlocale(locale.LC_NUMERIC, '')
students[["aged 18 to 24 in full-time education (thousands)"]].applymap(atof)
students["aged 18 to 24 in full-time education (thousands)"] = students["aged 18 to 24 in full-time education (thousands)"].map(atof)
students["aged 18 to 24 in full-time education"] = students["aged 18 to 24 in full-time education (thousands)"] * 1000
Add year and month columns to 18-24 student population data.
def add_year_column_to_students():
search = []
for values in students["date"]:
search.append(re.search(r'(\d{4})', values).group())
students["year"] = search
students["year"] = pd.to_datetime(students["year"]).dt.year
add_year_column_to_students()
def add_month_column_to_students():
students["month"] = students["date"].str.split(" ")[0][0]
add_month_column_to_students()
students.groupby("year").mean().hvplot(x='year', y='aged 18 to 24 in full-time education (thousands)') \
* students.groupby("year").mean().hvplot.scatter(x='year', y='aged 18 to 24 in full-time education (thousands)')
fig, ax = plt.subplots(figsize=(15,6))
sns.lineplot(
x="year",
y="aged 18 to 24 in full-time education (thousands)",
data=students
)
plt.xticks(rotation=15)
plt.title("Young people aged 18 to 24 in full-time education, seasonally adjusted, UK," +
" March to May 1992 to May to July 2016")
plt.show()
Student numbers (aged 18-24) have almost doubled since 1992. In 2016, the student population was around 1,896,000
students[["year", "aged 18 to 24 in full-time education"]] \
.groupby("year") \
.mean() \
.astype(int) \
.style.format({
"aged 18 to 24 in full-time education": "{:,d}"
})
| aged 18 to 24 in full-time education | |
|---|---|
| year | |
| 1992 | 1,034,500 |
| 1993 | 1,043,750 |
| 1994 | 1,117,750 |
| 1995 | 1,141,416 |
| 1996 | 1,133,166 |
| 1997 | 1,169,416 |
| 1998 | 1,231,416 |
| 1999 | 1,259,916 |
| 2000 | 1,257,750 |
| 2001 | 1,293,500 |
| 2002 | 1,301,333 |
| 2003 | 1,396,916 |
| 2004 | 1,420,666 |
| 2005 | 1,469,416 |
| 2006 | 1,487,916 |
| 2007 | 1,527,833 |
| 2008 | 1,609,250 |
| 2009 | 1,706,916 |
| 2010 | 1,825,500 |
| 2011 | 1,849,583 |
| 2012 | 1,880,666 |
| 2013 | 1,877,083 |
| 2014 | 1,872,833 |
| 2015 | 1,887,583 |
| 2016 | 1,896,571 |